2024-03-28
There, we try to learn about nested (hierarchical) data on faculty salaries. For each subject (faculty member) in the data, we have information on their salary, department and years of experience.
We expect that salary (and the relationship between salary and years of experience) may be different depending on department, and every subject is in exactly one department.
We’ll visit http://mfviz.com/hierarchical-models/ now to learn a bit about:
We’ll focus today on approaches using the lme4 package, which can be used both for linear mixed models and for generalized linear mixed models.
# Parameters for generating faculty salary data
departments <- c('sociology', 'biology', 'english',
'informatics', 'statistics')
base.salaries <- c(40000, 50000, 60000, 70000, 80000)
annual.raises <- c(2000, 500, 500, 1700, 500)
faculty.per.dept <- 25
total.faculty <- faculty.per.dept * length(departments)# Generate tibble of faculty and (random) years of experience
set.seed(432)
ids <- 1:total.faculty
department <- rep(departments, faculty.per.dept)
experience <- floor(runif(total.faculty, 0, 10))
bases <- rep(base.salaries, faculty.per.dept) *
runif(total.faculty, .9, 1.1) # noise
raises <- rep(annual.raises, faculty.per.dept) *
runif(total.faculty, .9, 1.1) # noise
facsal <- tibble(ids, department, bases, experience, raises)
# Generate salaries (base + experience * raise)
facsal <- facsal |>
mutate(salary = bases + experience * raises,
department = factor(department))facsal data# A tibble: 125 × 6
ids department bases experience raises salary
<int> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 sociology 39438. 2 1861. 43160.
2 2 biology 46046. 0 493. 46046.
3 3 english 63656. 9 458. 67776.
4 4 informatics 67330. 1 1573. 68903.
5 5 statistics 84116. 7 545. 87930.
6 6 sociology 41626. 9 1871. 58463.
7 7 biology 54687. 7 521. 58335.
8 8 english 64477. 2 468. 65413.
9 9 informatics 64456. 6 1722. 74787.
10 10 statistics 87841. 9 548. 92774.
# ℹ 115 more rows
m0 <- lm(salary ~ experience, data = facsal)
tidy(m0, conf.int = TRUE) |>
select(term, estimate, std.error,
conf.low, conf.high) |>
gt() |> fmt_number(decimals = 2) |>
tab_options(table.font.size = 20)| term | estimate | std.error | conf.low | conf.high |
|---|---|---|---|---|
| (Intercept) | 56,100.96 | 2,285.57 | 51,576.81 | 60,625.10 |
| experience | 1,772.01 | 412.96 | 954.58 | 2,589.44 |
m0 predictions and the datam0 predictions and the datam0 predictions with Department indicatorsm0 predictions with Department indicatorsm0 predictions and faceted results by Departmentggplot(data=facsal, aes(x=experience,
y=simple_model_preds)) +
geom_line() +
geom_point(aes(x=experience, y=salary,
group = department, colour = department)) +
labs(x="Experience",y="Salary (in $)",
title = "Linear Model Ignoring Department") +
guides(color = "none") +
scale_color_discrete('Department') +
facet_wrap(~ department)m0 predictions and faceted results by Departmentm0 Residuals by Departmentfacsal <- facsal |>
mutate(simple_model_resids = salary - simple_model_preds)
ggplot(data=facsal, aes(x=department,
y=simple_model_resids)) +
geom_boxplot() +
geom_hline(yintercept = 0, col = "red") +
labs(x="Experience", y="Residuals from Model m0",
title = "Residuals from Linear Model, by Department")m0 Residuals by DepartmentLinear mixed model fit by REML ['lmerMod']
Formula: salary ~ experience + (1 | department)
Data: facsal
REML criterion at convergence: 2428.544
Random effects:
Groups Name Std.Dev.
department (Intercept) 14728
Residual 4069
Number of obs: 125, groups: department, 5
Fixed Effects:
(Intercept) experience
59056 1138
This is the varying intercept model.
tidy(m1, conf.int = TRUE) |>
select(-std.error, -statistic) |>
gt() |> fmt_number(decimals = 1) |>
tab_options(table.font.size = 20)| effect | group | term | estimate | conf.low | conf.high |
|---|---|---|---|---|---|
| fixed | NA | (Intercept) | 59,056.3 | 46,075.5 | 72,037.2 |
| fixed | NA | experience | 1,138.4 | 890.1 | 1,386.6 |
| ran_pars | department | sd__(Intercept) | 14,728.0 | NA | NA |
| ran_pars | Residual | sd__Observation | 4,068.9 | NA | NA |
m1m1 predictionsm1 predictions without the datam1 predictions without the datam1 predictions and the dataggplot(data=facsal, aes(x=experience,
y=random_intercept_preds,
group = department,
col = department)) +
geom_line() +
geom_point(aes(x=experience, y=salary,
group = department, colour = department)) +
labs(x="Experience",y="Salary (in $)",
title = "Varying Intercept Salary Prediction") +
scale_color_discrete('Department') m1 predictions and the datam1 predictions and the data, faceted by Departmentggplot(data=facsal, aes(x=experience,
y=random_intercept_preds,
group = department,
col = department)) +
geom_line() +
geom_point(aes(x=experience, y=salary,
group = department, colour = department)) +
labs(x="Experience",y="Salary (in $)",
title = "Varying Intercept Salary Prediction") +
scale_color_discrete('Department') +
facet_wrap(~ department)m1 predictions and the data, faceted by Departmentm1 Residuals by Departmentfacsal <- facsal |>
mutate(random_intercept_resids =
salary - random_intercept_preds)
ggplot(data=facsal, aes(x=department,
y=random_intercept_resids)) +
geom_boxplot() +
geom_hline(yintercept = 0, col = "red") +
labs(x="Experience", y="Residuals from Model m1",
title = "Residuals of Varying Intercepts Model")m1 Residuals by DepartmentLinear mixed model fit by REML ['lmerMod']
Formula: salary ~ experience + (0 + experience | department)
Data: facsal
REML criterion at convergence: 2626.859
Random effects:
Groups Name Std.Dev.
department experience 2082
Residual 9438
Number of obs: 125, groups: department, 5
Fixed Effects:
(Intercept) experience
57705 1249
m2 Coefficientstidy(m2, conf.int = TRUE) |>
select(-std.error, -statistic) |>
gt() |> fmt_number(decimals = 1) |>
tab_options(table.font.size = 20)| effect | group | term | estimate | conf.low | conf.high |
|---|---|---|---|---|---|
| fixed | NA | (Intercept) | 57,704.7 | 54,616.9 | 60,792.5 |
| fixed | NA | experience | 1,249.3 | −661.1 | 3,159.6 |
| ran_pars | department | sd__experience | 2,081.6 | NA | NA |
| ran_pars | Residual | sd__Observation | 9,437.9 | NA | NA |
m2m2 predictionsm2 predictions without the datam2 predictions without the datam2 predictions and the dataggplot(data=facsal, aes(x=experience,
y=random_slope_preds,
group = department,
col = department)) +
geom_line() +
geom_point(aes(x=experience, y=salary,
group = department, colour = department)) +
labs(x="Experience",y="Salary (in $)",
title = "Varying Slope Salary Prediction") +
scale_color_discrete('Department') m2 predictions and the datam2 predictions and the data, faceted by Departmentggplot(data=facsal, aes(x=experience,
y=random_slope_preds,
group = department,
col = department)) +
geom_line() +
geom_point(aes(x=experience, y=salary,
group = department, colour = department)) +
labs(x="Experience",y="Salary (in $)",
title = "Varying Slope Salary Prediction") +
scale_color_discrete('Department') +
facet_wrap(~ department)m2 predictions and the data, faceted by Departmentm2 Residuals by Departmentm2 Residuals by DepartmentLinear mixed model fit by REML ['lmerMod']
Formula: salary ~ experience + (1 + experience | department)
Data: facsal
REML criterion at convergence: 2405.105
Random effects:
Groups Name Std.Dev. Corr
department (Intercept) 16320.1
experience 722.4 -0.64
Residual 3569.5
Number of obs: 125, groups: department, 5
Fixed Effects:
(Intercept) experience
59083 1165
m3 Coefficients| effect | group | term | estimate | std.error | statistic |
|---|---|---|---|---|---|
| fixed | NA | (Intercept) | 59,083.0 | 7,325.8 | 8.1 |
| fixed | NA | experience | 1,164.9 | 342.3 | 3.4 |
| ran_pars | department | sd__(Intercept) | 16,320.1 | NA | NA |
| ran_pars | department | cor__(Intercept).experience | −0.6 | NA | NA |
| ran_pars | department | sd__experience | 722.4 | NA | NA |
| ran_pars | Residual | sd__Observation | 3,569.5 | NA | NA |
m3m3 predictionsm3 predictions without the datam3 predictions without the datam3 predictions and the dataggplot(data=facsal, aes(x=experience,
y=random_slope_int_preds,
group = department,
col = department)) +
geom_line() +
geom_point(aes(x=experience, y=salary,
group = department, colour = department)) +
labs(x="Experience",y="Salary (in $)",
title = "Model m3 Salary Prediction") +
scale_color_discrete('Department') m3 predictions and the datam3 predictions and the data, faceted by Departmentggplot(data=facsal, aes(x=experience,
y=random_slope_int_preds,
group = department,
col = department)) +
geom_line() +
geom_point(aes(x=experience, y=salary,
group = department, colour = department)) +
labs(x="Experience",y="Salary (in $)",
title = "Model m3 Salary Prediction") +
scale_color_discrete('Department') +
facet_wrap(~ department)m3 predictions and the data, faceted by Departmentm3 Residuals by Departmentm3 Residuals by DepartmentLet’s refit model m3 and compare it to an appropriate null model (without the experience information), using an anova driven likelihood ratio test.
The REML = FALSE lets us get the likelihood ratio test we want.
m3 to m_nullData: facsal
Models:
m_null: salary ~ (1 | department)
m3: salary ~ experience + (1 + experience | department)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
m_null 3 2527.7 2536.2 -1260.9 2521.7
m3 6 2449.5 2466.5 -1218.8 2437.5 84.196 3 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
m3tidy(m3, conf.int = TRUE) |>
select(-std.error, -statistic) |>
gt() |> fmt_number(decimals = 1) |>
tab_options(table.font.size = 20)| effect | group | term | estimate | conf.low | conf.high |
|---|---|---|---|---|---|
| fixed | NA | (Intercept) | 59,078.1 | 46,212.3 | 71,943.9 |
| fixed | NA | experience | 1,165.7 | 565.6 | 1,765.9 |
| ran_pars | department | sd__(Intercept) | 14,610.5 | NA | NA |
| ran_pars | department | cor__(Intercept).experience | −0.6 | NA | NA |
| ran_pars | department | sd__experience | 636.3 | NA | NA |
| ran_pars | Residual | sd__Observation | 3,569.5 | NA | NA |
nBoot=100 # should probably be 1000 at a minimum
lrStat=rep(NA,nBoot)
# first fit appropriate null and alternate models
ft.null <- lm(salary ~ experience, data = facsal) #null model
ft.alt <- lmer(salary ~ experience + (1 | department),
data=facsal, REML=F) # alternate model
# calculate observed test statistic (deviance = -2 * loglik)
lrObs <- 2*logLik(ft.alt) - 2*logLik(ft.null) # test statset.seed(432)
for(iBoot in 1:nBoot)
{
facsal$SalSim=unlist(simulate(ft.null)) #resampled data
# calculate results for our two models in resampled data
bNull <- lm(SalSim ~ experience,
data=facsal) #null model
bAlt <- lmer(SalSim ~ experience + (1|department),
data=facsal, REML=F) # alternate model
# calculate and store resampled test stat
lrStat[iBoot] <- 2*logLik(bAlt) - 2*logLik(bNull)
}Our parametric bootstrap can hit up on the edge of a problem with the random effects.
boundary (singular) fit: see ?isSingular
is a common warning we might see, for instance.
A model for an outcome that incorporates both fixed and random effects.
Or, alternatively,…
Mixed models are those with a mixture of fixed and random effects. Random effects are categorical factors where the levels have been selected from many possible levels and the investigator would like to make inferences beyond just the levels chosen.
A random factor:
Think of a random factor as a group where:
Sources: https://bbolker.github.io/morelia_2018/notes/glmm.html and http://environmentalcomputing.net/mixed-models-1/
Source: Crawley (2002) and Gelman (2005) quoted at https://bbolker.github.io/morelia_2018/notes/glmm.html
The one I most often use is something like:
The various definitions in the literature are incompatible with each other1.
From Scahabenberger and Pierce (2001), we have this gem:
One modeler’s random effect is another modeler’s fixed effect.
A more practical definition might be to ask the question posed by Crawley (2002):
Are there enough levels of the factor in the data on which to base an estimate of the variance of the population of effects? No, means [you should probably treat the variable as] fixed effects.
Suppose we have an outcome y, predictor x and group group
y ~ x = linear regression on x: not a mixed modely ~ 1 + (1 | group) = random intercept on group: null modely ~ x + (1 | group) = fixed slope and random intercepty ~ (0 + x | group) = random slope of x within group, no variation in intercepty ~ x + (x | group) = random intercept and random slopeThe most common example in modern medicine has measurements nested within people. Repeated measures and longitudinal data provide typical settings for this sort of approach.
Another setting where a hierarchical approach is of interest occurs when you have variables measured at multiple levels, for instance you have information on patients, who are nested within providers, who are nested within hospitals.
Nothing of what I’ve talked about today should be taken as the final word on how to extend these ideas beyond the very simple example I’ve provided this afternoon.
Survival (time to event) regression models, specifically Cox proportional hazards models.
432 Class 20 | 2024-03-28 | https://thomaselove.github.io/432-2024/